2. Exploring a subdataset

As this dataset is completely real it may have a big amount of noise in its attributes. For knowing how to clean it, we can use only a month, such as August, the month with less amount of trips, only 12,597,109. After that, the process for the whole year will be automatized in the next notebook.

2.1. Preparing the notebook


In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

In [2]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sns.set(font='sans')

In [3]:
# This auxiliary function applies another one to every row in a DataFrame for creating new columns.
def iterate_and_apply(dataframe, function, necesary_columns):
    perform = True
    step = 100000
    start = 0
    to = step
    
    while perform:
        new_columns = dataframe[start:to][necesary_columns].apply(function, axis=1)
        if len(new_columns) == 0:
            perform = False
        else:
            dataframe.update(new_columns)
        new_columns = None
        start += step
        to += step
    
    return dataframe

2.2. Cleaning August fare data


In [4]:
data = pd.read_csv('../data/faredata/trip_fare_8.csv', skipinitialspace=True)

Let's take a look to the data. Each row of the fare CSV file represents a trip and the columns are attributes for these trips. In the next table we can check out the values of these attributes for the first five trips of the file.


In [5]:
data.head()


Out[5]:
medallion hack_license vendor_id pickup_datetime payment_type fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
0 3418135604CD3F357DD9577AF978C5C0 B25386A1F259C87449430593E904FDBC VTS 2013-08-30 07:57:00 CSH 41.5 0.0 0.5 0 0.00 42.00
1 6D3B2A7682C30DCF64F3F12976EF93B6 A603A9D5FAA46E8FF2A97A143328D938 CMT 2013-08-30 23:26:23 CSH 31.0 0.5 0.5 0 5.33 37.33
2 6D49E494913752B75B2685E0019FF3D5 3F0BFE90A5D71741840B25600A89E225 CMT 2013-08-30 09:18:10 CSH 5.5 0.0 0.5 0 0.00 6.00
3 4C4A0AFC432A1A87E97ED8F18403FF6E BA20A20E2CF85EF7B00162D711394C7E CMT 2013-08-26 23:27:11 CSH 23.0 0.5 0.5 0 5.33 29.33
4 1258CA1DF5E2A9E9A9F7848408A7AAEB 8C14DCF69CAA2A9A0DFAFD99E00536A1 CMT 2013-08-29 10:57:56 CSH 14.0 0.0 0.5 0 0.00 14.50

From all of them, the first three columns could sound a bit strange to us. Let's check them:

  • medallion: Hash of the taxi identifier.
  • hack_license: Hash of the taxi driver license.
  • vendor_id: Identifier of the devices's trade mark that collect all this information.

Knowing that, we just can skip the checking of the first, second and fourth attribute status. So, let's see how the values of vendor_id are distributed in the trips.


In [6]:
ax = data.groupby(['vendor_id']).size().plot(kind='bar')

ax.set_xlabel('vendor_id', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)


It looks like we don't need to treat these values, they are pretty well distributed.

Let's check the payment_type.


In [7]:
ax = data.groupby(['payment_type']).size().plot(kind='bar')

ax.set_xlabel('payment_type', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)


Here we can see that card and cash are the taxi's main payment types. The rows with the other strange values can be deleted. By doing this (also we are going to do this with the other attributes) we are adding a bit of bias to the predictions, but those values are so unusual that will hardly affect to the prediction's performance.

As the rest of the attributes are numeric, a way to help ourselves is by obtaining a few of statistical values from them.


In [8]:
data.describe()


Out[8]:
fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
count 12597109.000000 12597109.000000 12597109.000000 12597109.000000 12597109.000000 12597109.000000
mean 12.569788 0.331081 0.497757 1.359659 0.275257 15.131272
std 54.536458 0.800972 0.073919 2.351744 1.499754 252.274239
min -1430.000000 -19.500000 -0.500000 -96.820000 -22.250000 -1430.000000
25% 6.500000 0.000000 0.500000 0.000000 0.000000 8.000000
50% 9.500000 0.500000 0.500000 1.000000 0.000000 11.300000
75% 14.000000 0.500000 0.500000 2.000000 0.000000 16.700000
max 158995.810000 854.500000 80.050000 888.190000 960.090000 685908.100000

For fare_amount, an attribute that can be difficult to obtain a visualization, we can take a look to the previous table. There we can see that there are negative values! So, thinking a range of ordinary values for this attribute, something like between \$3.00 and \$200.00 comes to my mind.

Let's obtain a plot for the surcharge values.


In [9]:
ax = data.groupby(['surcharge']).size().plot(kind='bar', figsize=(15, 5))

ax.set_xlabel('surcharge', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)


What are we going to do here? Of course, save only the trips with \$0, \$0.50 and \$1.00 values.

For mta_tax we obtain the next plot.


In [10]:
ax = data.groupby(['mta_tax']).size().plot(kind='bar')

ax.set_xlabel('mta_tax', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)


So, something like the previous case, but here we are going to save only the trips with $0.50.

We are going to skip tip_amount for the moment.

Obtain a useful representation from tolls_amount is very dificult because the huge range of values. Maybe, the cause of this is that they are manually introduced. Also, these values can change as the course of the year, so they probably aren't going to be same in all the months.

A solution for that might be to obtain the values that are repeated, for example, more that a thousand times.


In [11]:
tolls = data.groupby(['tolls_amount']).size()

print tolls[tolls >= 1000.0]
tolls = None


tolls_amount
0.00            11985089
2.44                5045
5.33              569389
7.50                1283
8.25                5517
10.25              12782
10.66               3542
15.58               1095
dtype: int64

We can say that a good range for this value may be something like \$0.00 and \$30.00.

We aren't going to take into account the total_amount attribute. This attribute is the sum of the previous ones, so any error here it's unlikely to happen. Also, we aren't going to use it for the learning algorithm, because the tip it's implicitly included on it.

Let's clean all of these attributes!


In [12]:
payment_type = ((data.payment_type == 'CRD') | (data.payment_type == 'CSH'))
fare_amount = ((data.fare_amount >= 3.0) & (data.fare_amount <= 200.0))
surcharge = ((data.surcharge == 0.0) | (data.surcharge == 0.5) | (data.surcharge == 1.0))
mta_tax = (data.mta_tax == 0.5)
tip_amount = ((data.tip_amount >= 0.0) & (data.tip_amount <= 100.0))
tolls_amount = ((data.tolls_amount >= 0.0) & (data.tolls_amount <= 30.0))

# Let's save it in another variable, I have a hunch.
data_aux = data[payment_type & fare_amount & surcharge & mta_tax & tip_amount & tolls_amount]

payment_type = None
fare_amount = None
surcharge = None
mta_tax = None
tip_amount = None
tolls_amount = None

A way for solving the hunch I commented in the previous piece of code could be to plot the payment_attribute.


In [13]:
ax = data_aux.groupby(['payment_type']).size().plot(kind='bar')

ax.set_xlabel('payment_type', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)


And another time for only showing the trips with a tip greater than $0.00.


In [14]:
ax = data_aux[(data_aux.tip_amount > 0)].groupby(['payment_type']).size().plot(kind='bar')

ax.set_xlabel('payment_type', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)


What is happening here? We all know :(

So, to remove this annoying noise, let's delete the cash payment type. Also, we can delete the payment_type attribute, because this have only one different value.


In [15]:
data_aux = None

payment_type = (data.payment_type == 'CRD')
fare_amount = ((data.fare_amount >= 3.0) & (data.fare_amount <= 200.0))
surcharge = ((data.surcharge == 0.0) | (data.surcharge == 0.5) | (data.surcharge == 1.0))
mta_tax = (data.mta_tax == 0.5)
tip_amount = ((data.tip_amount >= 0.0) & (data.tip_amount <= 100.0))
tolls_amount = ((data.tolls_amount >= 0.0) & (data.tolls_amount <= 30.0))

data = data[payment_type & fare_amount & surcharge & mta_tax & tip_amount & tolls_amount]

payment_type = None
fare_amount = None
surcharge = None
mta_tax = None
tip_amount = None
tolls_amount = None

data.drop(['payment_type'], axis=1, inplace=True)

2.3. Calculating the tip percentage

For obtaining a better value to predict, we can obtain a normalized version of the tip, the tip percentage. We can use the next function:

$$ tip\_perc\:=\:\frac{tip\_amount}{fare\_amount\:+\:surcharge\:+\:mta\_tax}\:\cdot\:100 $$

In [16]:
tip_perc_necesary_columns = ['fare_amount', 'surcharge', 'mta_tax', 'tip_amount']
tip_perc_column_name = 'tip_perc'

def calculate_tip_perc(row):
    subtotal = row.fare_amount + row.surcharge + row.mta_tax
    tip = row.tip_amount / subtotal
    tip_perc = tip * 100
    
    return pd.Series({tip_perc_column_name: tip_perc})

In [17]:
data[tip_perc_column_name] = np.nan

data = iterate_and_apply(data, calculate_tip_perc, tip_perc_necesary_columns)

# Now that the 'mta_tax' attribute has been used it can be deleted because it only has one value.
data.drop(['mta_tax'], axis=1, inplace=True)

Let's take a look to the statistical values of the new attribute.


In [18]:
data.describe()


Out[18]:
fare_amount surcharge tip_amount tolls_amount total_amount tip_perc
count 6697963.000000 6697963.000000 6697963.000000 6697963.000000 6697963.000000 6697963.000000
mean 13.065660 0.337271 2.493890 0.303043 16.699864 18.186398
std 9.850364 0.362734 2.249244 1.258718 12.447595 9.530609
min 3.000000 0.000000 0.000000 0.000000 3.500000 0.000000
25% 7.000000 0.000000 1.200000 0.000000 9.500000 14.285714
50% 10.000000 0.500000 2.000000 0.000000 12.600000 19.354839
75% 15.000000 0.500000 2.900000 0.000000 18.600000 20.000000
max 200.000000 1.000000 100.000000 30.000000 260.500000 2000.000000

A 2,000% tip! This percentage may be adjusted to a more ordinary range, something like 0% and 50%.


In [19]:
tip_perc = (data.tip_perc <= 50.0)
data = data[tip_perc]

tip_perc = None

2.4. Adding the trip data file

The file that contains the physical information of each trip can not be readed due to errors in one of its columns. Let's take a look to all of them explaining those that they need it:

  • medallion
  • hack_license
  • vendor_id
  • rate_code: We can get more information about this attribute here.
  • store_and_fwd_flag: This is the column with errors. A solution to this issue may be ignoring its values.
  • pickup_datetime
  • dropoff_datetime: We don't need this one, its information is implicit in pickup_datetime and trip_time_in_secs.
  • passenger_count
  • trip_time_in_secs
  • trip_distance
  • pickup_latitude
  • pickup_longitude
  • dropoff_latitude
  • dropoff_longitude

So, we are going to read only the valid and new columns.


In [20]:
tripdata = pd.read_csv('../data/tripdata/trip_data_8.csv', skipinitialspace=True, usecols=[3, 7, 8, 9, 10, 11, 12, 13])

For join this new DataFrame with the fare one, we are going to use their indices as... an index! It's done that way because the trips have the same order in both files.


In [21]:
for column in tripdata.columns:
    data[column] = np.nan

'''
Careful! The next line use a lot of memory, a least more than 8GB.
It's showed here beacuse it's simplier that the another way I know to update the 'fare' DataFrame:
    Read the 'trip' file with NumPy for creating a lot of little DataFrames and update them
    into the data variable in a loop.
'''
data.update(tripdata)
tripdata = None

2.5. Cleaning August trip data

So, let's start to check the new attributes. The first of them is rate_code.


In [22]:
ax = data.groupby('rate_code').size().plot(kind='bar')

ax.set_xlabel('rate_code', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)


We should only save the rows with 1 as the value for this attribute. Also, we can delete this attribute, because it has only one different value.

The next attribute to check is passenger_count.


In [23]:
ax = data.groupby('passenger_count').size().plot(kind='bar')

ax.set_xlabel('passenger_count', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)


A usual trip has 1 to 6 passengers. So, we can discard the others.

A couple of attributes that are very realated are trip_time_in_secs and trip_distance. Let's check out the next table with the statistical values of the attribures.


In [24]:
data.describe()


Out[24]:
fare_amount surcharge tip_amount tolls_amount total_amount tip_perc rate_code passenger_count trip_time_in_secs trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
count 6679377.000000 6679377.00000 6679377.000000 6679377.000000 6679377.000000 6679377.000000 6679377.000000 6679377.000000 6679377.000000 6679377.000000 6679377.000000 6679377.000000 6679377.000000 6679377.000000
mean 13.073027 0.33731 2.470901 0.303073 16.684310 17.959636 1.026901 1.667426 1421.444612 82.635783 -73.210150 40.326279 -73.201282 40.322172
std 9.848450 0.36280 2.113576 1.258079 12.411177 6.497662 0.328381 1.355291 51981.318368 25563.660419 7.535068 4.521099 7.597918 4.625446
min 3.000000 0.00000 0.000000 0.000000 3.500000 0.000000 0.000000 0.000000 -1.000000 0.000000 -1958.327000 -3113.289800 -2216.325200 -3114.304900
25% 7.000000 0.00000 1.200000 0.000000 9.500000 14.285714 1.000000 1.000000 420.000000 1.190000 -73.993515 40.733177 -73.992462 40.731628
50% 10.000000 0.50000 2.000000 0.000000 12.600000 19.333333 1.000000 1.000000 660.000000 2.000000 -73.982887 40.750454 -73.981445 40.750904
75% 15.000000 0.50000 2.900000 0.000000 18.600000 20.000000 1.000000 2.000000 1020.000000 3.600000 -73.969521 40.764839 -73.966156 40.765533
max 200.000000 1.00000 60.000000 30.000000 260.500000 50.000000 210.000000 8.000000 4294966.000000 15038005.000000 51.425247 473.958280 93.411148 1344.402100

Negative values, trips lasting more than 49 days traveling million of miles. A crazy thing.

So, for fix that, we can use Google Maps and look for a long, but usual trip, like this one. A trip around 50 minutes for travelling 21.1 miles. So, we can use a maximum of 1 hour (3,600 seconds) and 25 miles.

Continuing with the attributes, it's the turn of the coordinates, longitude and latitude for pickups and dropoffs. By observing the previous table we can notice coordinates that don't even exist! For fix that, we can use only the coordinates satisfying the conditions in the following table.

min max
latitude 40.459518 41.175342
longitude −74.361107 −71,903083

By filtering to that coordinates, we will use the area showed in the next image, extracted from Google Maps.


In [25]:
rate_code = (data.rate_code == 1.0)
passenger_count = ((data.passenger_count >= 1.0) & (data.passenger_count <= 6.0))
trip_time_in_secs = ((data.trip_time_in_secs > 0.0) & (data.trip_time_in_secs <= 3600.0))
trip_distance = ((data.trip_distance > 0.0) & (data.trip_distance <= 25.0))

pickup_latitude = ((data.pickup_latitude >= 40.459518) & (data.pickup_latitude <= 41.175342))
pickup_longitude = ((data.pickup_longitude >= -74.361107) & (data.pickup_longitude <= -71.903083))
dropoff_latitude = ((data.dropoff_latitude >= 40.459518) & (data.dropoff_latitude <= 41.175342))
dropoff_longitude = ((data.dropoff_longitude >= -74.361107) & (data.dropoff_longitude <= -71.903083))

data = data[rate_code & passenger_count & trip_time_in_secs & trip_distance & pickup_latitude & pickup_longitude
            & dropoff_latitude & dropoff_longitude]

rate_code = None
passenger_count = None
trip_time_in_secs = None
trip_distance = None

pickup_latitude = None
pickup_longitude = None
dropoff_latitude = None
dropoff_longitude = None

data.drop(['rate_code'], axis=1, inplace=True)

But, what about the possible coordiantes that point to the water? Let's plot the southwest area of the city.


In [26]:
num_samples = 1250000
indices = np.random.choice(data.index, num_samples)

pickup_x = data.pickup_longitude[indices].values
pickup_y = data.pickup_latitude[indices].values
dropoff_x = data.dropoff_longitude[indices].values
dropoff_y = data.dropoff_latitude[indices].values

In [27]:
sns.set_style('white')

fig, ax = plt.subplots(figsize=(11, 12))

ax.scatter(pickup_x, pickup_y, s=5, color='blue', alpha=0.5)
ax.scatter(dropoff_x, dropoff_y, s=5, color='red', alpha=0.5)

ax.set_xlim([-74.05, -74.00])
ax.set_ylim([40.70, 40.75])

ax.set_title('coordinates')



In [28]:
sns.set_style('darkgrid')

indices = None
pickup_x = None
pickup_y = None
dropoff_x = None
dropoff_y = None

This is a bit strange, because this dataset doesn't include water taxi information. Just kidding.

Let's fix this issue in the next section.

2.6. Filtering the coordinates of unusal places

A solution to the coordinates in the water problem could be dividing the whole area into smaller square areas. These smaller areas should have a relative small distance in its sides. Let's use 270 meters (885 feet) aproximately as side.

After creating the smaller areas, we could delete the trips that its pickup or dropoff happened in an area with less pickups or dropoffs than a specific number, for example, 20.

So, let's create these smaller areas and plot the coordinates in them.


In [29]:
# Coordinates variables.
min_x = -74.361107
min_y = 40.459518

max_x = -71.903083
max_y = 41.175342

step_x = 0.003205  # 270 meters (885 feet) approximately.
step_y = 0.002429  # 270 meters (885 feet) approximately.

# Number of maximum areas per axis in the 'area matrix'.
max_area_x = 766  # int((max_x - min_x) / step_x)
max_area_y = 294  # int((max_y - min_y) / step_y)

# Number of columns of the 'area matrix'.
num_columns = 767  # max_area_x + 1

# The area at the top right corner in the 'area matrix'.
max_area = 226264  # (max_area_y * num_columns) + max_area_x

# Minimum number of pickups or dropoffs per area.
min_area_number = 20


# Columns for creating the pickup and dropoff area attributes.
area_column_names = ['pickup_area', 'dropoff_area']
area_necesary_columns = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']


# Functions for calculating the pickup and dropoff areas in the 'area matrix'.

def get_1d_area(coor, min_coor, step_coor):
    return int((coor - min_coor) / step_coor)

def get_area(area_x, area_y, columns):
    return (area_y * columns) + area_x

def calculate_area(row):
    pickup_x = get_1d_area(row.pickup_longitude, min_x, step_x)
    pickup_y = get_1d_area(row.pickup_latitude, min_y, step_y)
    pickup_area = get_area(pickup_x, pickup_y, num_columns)
    
    dropoff_x = get_1d_area(row.dropoff_longitude, min_x, step_x)
    dropoff_y = get_1d_area(row.dropoff_latitude, min_y, step_y)
    dropoff_area = get_area(dropoff_x, dropoff_y, num_columns)
    
    return pd.Series({
        area_column_names[0]: pickup_area,
        area_column_names[1]: dropoff_area
    })

In [30]:
for column in area_column_names:
    data[column] = np.nan

data = iterate_and_apply(data, calculate_area, area_necesary_columns)

In [31]:
color_list = ['red', 'cyan']

num_samples = 1250000
index_samples = np.random.choice(data.index, num_samples)

samples = data.ix[index_samples]
pickup = samples.groupby(['pickup_area']).size().index
dropoff = samples.groupby(['dropoff_area']).size().index
samples = None

areas = np.concatenate([pickup, dropoff])
areas = np.unique(areas)

In [32]:
sns.set_style('white')

fig, ax = plt.subplots(figsize=(22, 24))
for area in areas.tolist():
    if area % 2 == 0:
        color = color_list[0]
    else:
        color = color_list[1]
    data_pickup = data[(data.pickup_area == area)]
    data_dropoff = data[(data.dropoff_area == area)]
    pickup_x = data_pickup.pickup_longitude.values
    pickup_y = data_pickup.pickup_latitude.values
    dropoff_x = data_dropoff.dropoff_longitude.values
    dropoff_y = data_dropoff.dropoff_latitude.values
    ax.scatter(pickup_x, pickup_y, s=0.1, color=color, alpha=0.2)
    ax.scatter(dropoff_x, dropoff_y, s=0.1, color=color, alpha=0.2)

ax.set_xlim([-74.1, -73.7])
ax.set_ylim([40.6, 40.9])

ax.set_title('coordinates')



In [33]:
sns.set_style('darkgrid')

index_samples = None
pickup = None
dropoff = None
areas = None
data_pickup = None
data_dropoff = None
pickup_x = None
pickup_y = None
dropoff_x = None
dropoff_y = None

Pretty cool plot!

So, it's time to remove the trips that happened in unusual areas. Also, we don't need these area attributes anymore.


In [34]:
# Filtering the amount of pickups in the same area.
aux_pickup = data.groupby(['pickup_area']).size()
aux_pickup = aux_pickup[aux_pickup >= min_area_number]
aux_pickup = data['pickup_area'].isin(aux_pickup.index)

# Filtering the amount of dropoffs in the same area.
aux_dropoff = data.groupby(['dropoff_area']).size()
aux_dropoff = aux_dropoff[aux_dropoff >= min_area_number]
aux_dropoff = data['dropoff_area'].isin(aux_dropoff.index)

# Using the filters.
data = data[aux_pickup & aux_dropoff]
aux_pickup = None
aux_dropoff = None

data.drop(['pickup_area', 'dropoff_area'], axis=1, inplace=True)

2.7. Saving the August file

Now we just need to save this clean month in a file. Also, it's a good idea to order the DataFrame's columns.


In [35]:
column_order = ['medallion', 'hack_license', 'vendor_id', 'pickup_datetime', 'fare_amount', 'surcharge', 'tip_amount',
                'tip_perc', 'tolls_amount', 'total_amount', 'passenger_count', 'trip_time_in_secs', 'trip_distance',
                'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']

data = data.reindex_axis(column_order, axis=1)

In [36]:
if not os.path.exists('../data/cleaned/'):
    os.makedirs('../data/cleaned/')

data.to_csv('../data/cleaned/cleaned_8.csv', index=True)

In the next notebook, this entire process will be applied to other months. After that, our initial dataset with more than 170 millions trips in 48.6GB will be reduced to 88,156,805 trips in 17.3GB.